This is an R Markdown workbook! It is a way to combine text and plots in one clean, easy-to-read document. Chunks like this one compile as text! Chunks between back ticks (like below) are passed to R. Stuff preceeded by a hash mark gets presented as section headers.
This is the completed version of the workbook. The student workbook has the same code with some stuff removed and replaced with ??, to encourage active learning.
## ^ you can make the code chunk appear in the document by flagging as include = T
### use lexical decision data from Harald Baayen: get it, open the help file for it, and look at it
data(lexdec)
?lexdec
## starting httpd help server ... done
summary(lexdec)
## Subject RT Trial Sex NativeLanguage
## A1 : 79 Min. :5.829 Min. : 23 F:1106 English:948
## A2 : 79 1st Qu.:6.215 1st Qu.: 64 M: 553 Other :711
## A3 : 79 Median :6.346 Median :106
## C : 79 Mean :6.385 Mean :105
## D : 79 3rd Qu.:6.502 3rd Qu.:146
## I : 79 Max. :7.587 Max. :185
## (Other):1185
## Correct PrevType PrevCorrect Word
## correct :1594 nonword:855 correct :1542 almond : 21
## incorrect: 65 word :804 incorrect: 117 ant : 21
## apple : 21
## apricot : 21
## asparagus: 21
## avocado : 21
## (Other) :1533
## Frequency FamilySize SynsetCount Length
## Min. :1.792 Min. :0.0000 Min. :0.6931 Min. : 3.000
## 1st Qu.:3.951 1st Qu.:0.0000 1st Qu.:1.0986 1st Qu.: 5.000
## Median :4.754 Median :0.0000 Median :1.0986 Median : 6.000
## Mean :4.751 Mean :0.7028 Mean :1.3154 Mean : 5.911
## 3rd Qu.:5.652 3rd Qu.:1.0986 3rd Qu.:1.6094 3rd Qu.: 7.000
## Max. :7.772 Max. :3.3322 Max. :2.3026 Max. :10.000
##
## Class FreqSingular FreqPlural DerivEntropy
## animal:924 Min. : 4.0 Min. : 0.0 Min. :0.0000
## plant :735 1st Qu.: 23.0 1st Qu.: 19.0 1st Qu.:0.0000
## Median : 69.0 Median : 49.0 Median :0.0370
## Mean : 132.1 Mean :109.7 Mean :0.3856
## 3rd Qu.: 146.0 3rd Qu.:132.0 3rd Qu.:0.6845
## Max. :1518.0 Max. :854.0 Max. :2.2641
##
## Complex rInfl meanRT SubjFreq
## complex: 210 Min. :-1.3437 Min. :6.245 Min. :2.000
## simplex:1449 1st Qu.:-0.3023 1st Qu.:6.322 1st Qu.:3.160
## Median : 0.1900 Median :6.364 Median :3.880
## Mean : 0.2845 Mean :6.379 Mean :3.911
## 3rd Qu.: 0.6385 3rd Qu.:6.420 3rd Qu.:4.680
## Max. : 4.4427 Max. :6.621 Max. :6.040
##
## meanSize meanWeight BNCw BNCc
## Min. :1.323 Min. :0.8244 Min. : 0.02229 Min. : 0.0000
## 1st Qu.:1.890 1st Qu.:1.4590 1st Qu.: 1.64921 1st Qu.: 0.1625
## Median :3.099 Median :2.7558 Median : 3.32071 Median : 0.6500
## Mean :2.891 Mean :2.5516 Mean : 7.37800 Mean : 5.0351
## 3rd Qu.:3.711 3rd Qu.:3.4178 3rd Qu.: 7.10943 3rd Qu.: 2.9248
## Max. :4.819 Max. :4.7138 Max. :79.17324 Max. :83.1949
##
## BNCd BNCcRatio BNCdRatio
## Min. : 0.000 Min. :0.00000 Min. :0.0000
## 1st Qu.: 1.188 1st Qu.:0.09673 1st Qu.:0.5551
## Median : 3.800 Median :0.27341 Median :0.9349
## Mean : 12.995 Mean :0.45834 Mean :1.5428
## 3rd Qu.: 10.451 3rd Qu.:0.55550 3rd Qu.:2.1315
## Max. :241.561 Max. :8.29545 Max. :6.3458
##
## and once you run a chunk that includes output, you can minimise it again by hitting the double-arrow in the top right.
Ins and outs of ggplot: building a plot from the ground up.
An idea borrowed from http://joeystanley.com/blog/making-vowel-plots-in-r-part-1
Let’s build a ggplot object from the beginning…
#open a plot window
ggplot()
Add data to it!– still no obvious output
ggplot(data=lexdec)
To show the ggplot object, we include which factors to use with aes (aesthetics) arguments.
As a shorthand, the first argument will be read as the data. You don’t have to specify it.
This plot window now has X and Y axes.
ggplot(lexdec,aes(y=RT,x=Frequency))
We add stuff in to the plot with ‘geom’ functions that literally are added commands.
ggplot(lexdec,aes(y=RT,x=Frequency))+
geom_point()
We can also save the object at any stage of the process, and then add to it as we go.
pl1 <- ggplot(lexdec,aes(y=RT,x=Frequency))
pl1 + geom_point()
Add a smoothed line;
pl1 + geom_smooth(method='lm')
Add points & a smoothed line!
pl1 + geom_point() + geom_smooth(method='lm')
There are many smooth options. Search geom_smooth docs online for details.
Two more:
Loess smooth (a local regression)
pl1 + geom_point() + geom_smooth(method='loess')
General additive model
pl1 + geom_point() + geom_smooth(method='gam')
In the plots we made above, there are a lot of points on top of each other!
To fix, use geom_jitter instead of geom_point. Geom_jitter is like geom_point but has parameters for randomly moving points horizontally and vertically.
pl1 + geom_jitter(width=.1,height=.1)
Try with different levels of jitter… This is too much.
pl1 + geom_jitter(width=1,height=1)
Try with adjustment to alpha (point transparency) instead.
pl1 + geom_point(alpha=.1)
An excellent feature of ggplot is that you can add on aes values for any number of things at once. Let’s now add to the above plot, with lines for native language and shapes for whether the trial was correct.
We will write this out in full, so that we can specify the aes directly in the ggplot call. I like this plot, but want to add some color to it.
ggplot(lexdec,aes(y=RT,x=Frequency,linetype=NativeLanguage,shape=Correct)) +
geom_point(alpha=.5)+
geom_smooth(method='lm')
Adding color is a good way to make things visually distinct. Here, I’m also supressing the se bands on the smooths.
pl2 <- ggplot(lexdec,aes(y=RT,x=Frequency, shape=Correct, lty=Correct, color=NativeLanguage)) +
geom_point(alpha=.5)+
geom_smooth(method='lm',se=F)
pl2
Color is also avaliable for things with lots of levels…but it can get hard to read!
pl3 <- ggplot(lexdec,aes(y=RT,x=Frequency,
color=Subject,lty=NativeLanguage)) +
geom_point(alpha=.5)+
geom_smooth(method='lm',se=F)
pl3
We can add panels to a plot with facet_grid or facet_wrap.
pl2 + facet_grid(.~NativeLanguage)
pl2 + facet_grid(NativeLanguage~.)
Facet grid makes a little grid of panels (rows~columns)
pl2 + facet_grid(Correct~NativeLanguage)
Facet_wrap wraps around… This is useful for something with lots of levels.
pl3 + facet_wrap(~Subject,ncol=7) ## can specify ncol(umns) and nrow(s)
We can change the labels and scales on x and y axes!
pl2 + scale_x_continuous(name="Word Frequency") +
scale_y_continuous(name="Lexical Decision RT")
pl2+scale_x_continuous(name="Word Frequency",limits=c(1,10))+
scale_y_continuous(name="Lexical Decision RT",limits=c(5.5,8))
We can also add a main title
pl2 +
ggtitle("Lexical decision RT predicted by word frequency and correctness ")
## these can include enters
pl2 +
ggtitle("Lexical decision RT \n predicted by word frequency and correctness")
## and we can center them!
pl2 +
ggtitle("Lexical decision RT \n predicted by word frequency and correctness")+
theme(plot.title = element_text(hjust = 0.5))
A bubble plot just changes the size of points x,y by some factor z. This can be either continuous or discrete.
Change the point size by morphological family size (which co-varies with word-frequency: more frequent words come from more common morphological families.) The factor is called FamilySize
ggplot(lexdec,aes(y=RT,x=Frequency,size=FamilySize))+
geom_point(alpha=.1)
Add a trend line for the data mean
mRT <- mean(lexdec$RT)
pl4 <- ggplot(lexdec,aes(y=RT,x=Frequency,size=FamilySize))+
geom_point(alpha=.1)
pl4 + geom_hline(yintercept=mRT,color='red')
Add a trend line for the average frequency
mF <- mean(lexdec$Frequency)
pl4 + geom_vline(xintercept=mF,color='red')
Color points based upon mean value
lexdec$RTBin <- as.factor( ifelse(lexdec$RT > mRT, 2,1) )
ggplot(lexdec,aes(y=RT,x=Frequency,size=FamilySize,color=RTBin))+
geom_point(alpha=.1)+
geom_hline(yintercept=mRT,color='red')
Let’s tabulate the data to get average RTs by word.
#Do tabulations using dplyr:
mbyWord <- lexdec %>% group_by(Frequency,Word)%>%
summarise(meanRT=mean(RT))
ggplot(mbyWord,aes(x=Frequency,y=meanRT,label=Word)) +
geom_text()
Let’s make a heat map. These need a data frame of x by y, containing z values. Let’s try a different visualization of RT for correct and error trials by participant, split by their native language.
#Do tabulations using dplyr:
corRT <- lexdec %>% group_by(Subject,Correct,NativeLanguage)%>%
summarise(meanRT=mean(RT))
ggplot(corRT,aes(x=Correct,y=Subject)) +
geom_tile(aes(fill=meanRT)) +
facet_grid(~NativeLanguage)
## we can change the color map to be a little more useful
ggplot(corRT,aes(Correct,Subject)) +
geom_tile(aes(fill=meanRT)) +
scale_fill_gradient(low="yellow",high="red")
Let’s build our own color map that assigns colors to participants based upon their native language. It also serves as a little bit of a coding exercise.
## let's make a new variable that we can use to cluster together all the English / Other subs. Paste together the native language of the person, and their subject ID
lexdec$Subj2 <- paste(lexdec$NativeLanguage,
lexdec$Subject,sep=" ")
## to make a new color palette
## get a list of all English speakers
# & find out how big it is
## there are lots of ways to get this number-- here is one of them!
eng<- lexdec %>% filter(NativeLanguage=="English") %>%
group_by(Subject) %>% summarise()
dim(eng)[1]
## [1] 12
## 12 levels, so we need 12 shades of blue
## I'm asking a function to create 12 divisions of blue colors
## between two named endpoints
## r has lots of named colors!-- see http://www.stat.columbia.edu/~tzheng/files/Rcolor.pdf
## you can also use Hexadecimal colors (like in this cols list)
blues<-colorRampPalette(colors=c("steelblue1","darkblue"))(12)
## now get a list of all other speakers & find out its size
oth <- lexdec %>% filter(NativeLanguage=="Other") %>%
group_by(Subject) %>% summarise()
dim(oth)[1]
## [1] 9
## 9 levels, so we need 9 shades of pink
pinks <- colorRampPalette(colors=c("pink","magenta"))(9)
## concatenate the two sets of colors, making one list
colors <- c(blues,pinks)
## use this as our color palette, with the sorted subject variable
ggplot(lexdec,aes(y=RT,x=Frequency,color=Subj2,
lty=NativeLanguage)) +
geom_point(alpha=.5)+
geom_smooth(method='lm',se=F)+
scale_color_manual(values=colors)
That legend is very hard to read. Fix it by supressing values for color. We’ll do this in two steps.
pl7 <- ggplot(lexdec,aes(y=RT,x=Frequency,
color=Subj2,lty=NativeLanguage)) +
geom_point(alpha=.5)+
geom_smooth(method='lm',se=F)+
## supress the legend for color
scale_color_manual(values=colors,guide=FALSE)
pl7
## and now redo the legend for line type to include color
# use guides() and override.aes() functions to do this
pl7<- pl7 + guides(linetype = guide_legend(override.aes =
list(color = c("blue","magenta")) ))
pl7
Scale_XXX_manual can be used to change all types of parameters, such as legends, or line types.
pl7 + scale_linetype_manual(name="Native Language",values=c(1,4)) ## these are built-in line types
I also want to change point shapes. Here’s a list of point shape (pch) values: http://www.sthda.com/english/wiki/ggplot2-point-shapes
pl8 <- ggplot(lexdec,aes(y=RT,x=Frequency,color=Subj2,
shape=NativeLanguage)) +
geom_point(alpha=.5)+
geom_smooth(method='lm',se=F)+
facet_grid(~NativeLanguage)+
scale_color_manual(values=colors,guide=FALSE)+
scale_shape_manual(values=c(1,18),guide=FALSE)
pl8
Change some more labels
#set up a plot
pl8b <- ggplot(lexdec,aes(y=RT,x=Frequency,color=Subj2,
shape=NativeLanguage,linetype=NativeLanguage)) +
facet_grid(~NativeLanguage)+
geom_point(alpha=.5)+
geom_smooth(method='lm',se=F)+
scale_color_manual(values=colors,guide=FALSE)
## if we're mucking with legend titles, we have to also add a title to points
### compare this...
pl8b + scale_linetype_manual(name="Native Language",values=c(1,4))+
guides(linetype = guide_legend(override.aes =
list(color = c("blue","magenta")) ))
## with this...
pl8b + scale_linetype_manual(name="Native Language",values=c(1,4))+
scale_shape_manual(name="Native Language", values=c(1,18))+
guides(linetype = guide_legend(override.aes =
list(color = c("blue","magenta")) ))
You can change to knit to PDF (if you have Latex) or to Word (if you have it). Use the down-arrow by the knit button.
Make some plots using the tools we have developed today.
Option 1: Pick a new variable in the Lexdec dataset. Use the tools we built today to understand what it does. Run the code ?lexdec to get more info!
Figure out how to map it to an aes like color (for discrete or continuous factors), facet, shape or linetype (for discrete factors), x or y axis (for predictors vs dependant meausures), frequency (in a heat map), or size (in a bubble plot).
Try different combinations of aes to figure out what the relationship between the factors is.
Option 2: Pick a new data set, and run one of the same plots we created above with the new variables. These are some interesting data sets you have in active R libraries. Run the code ?data_set_name to get more info, where the options below are data_set_name
We will cover the distribution plots: bar plots + error bars, violin plots, beeswarms, and ‘ridges’ (joyplots). Plus, anything else you guys want covered.
Send email to laubre@mpi.nl if you have requests of a plot type, or a specific type of data to cover.